Tagging Romanian Texts: a Case Study for QTAG, a Language Independent Probabilistic Tagger

نویسندگان

  • Dan Tufis
  • Oliver Mason
چکیده

This paper describes an experiment on tagging Romanian using QTAG, a parts-of-speech tagger that has been developed originally for English, but with a clear separation between the (probabilistic) processing engine and the (language specific)resource data. This way, the tagger is usable across various languages as shown by successful experiments on three quite different languages: English, Swedish and Romanian. After a brief presentation of the QTAG tagger, the paper dwells on language resources for Romanian and the evaluation of the results. A complexity metrics for tagging experiments is proposed which considers the performance of a tagger with respect to the “difficulty” of a text.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Probabilistic tagging of minority language data: a case study using Qtag

While probabilistic methods of part-of-speech tag assignment have long received consideration in corpus and computational-linguistic research, less attention would appear to have been paid to date to the development of tagging accuracy over rounds of iterative, interactive training in applications of these methods. Understanding this aspect of probabilistic tagging is arguably of particular imp...

متن کامل

Towards a Bayesian Stochastic Part-Of-Speech and Case Tagger of Natural Language Corpora

This paper introduces and evaluates a Bayesian Network probabilistic model for automatic Part-Of-Speech tagging of Modern Greek natural language texts. The Bayesian model for the task of POS tagging is mathematically formed and is compared to that of Hidden Markov, a broadly applied methodology. Our model is trained from annotated corpora, using lexical as well as contextual information. Unlike...

متن کامل

Adapting the TTL Romanian POS Tagger to the Biomedical Domain

This paper presents the adaptation of the Hidden Markov Models-based TTL partof-speech tagger to the biomedical domain. TTL is a text processing platform that performs sentence splitting, tokenization, POS tagging, chunking and Named Entity Recognition (NER) for a number of languages, including Romanian. The POS tagging accuracy obtained by the TTL POS tagger exceeds 97% when TTL’s baseline mod...

متن کامل

Bayesian Reinforcement for a Probabilistic Neural Net Part-of-Speech Tagger

The present paper introduces a novel stochastic model for Part-OfSpeech tagging of natural language texts. While previous statistical approaches, such as Hidden Markov Models, are based on theoretical assumptions that are not always met in natural language, we propose a methodology which incorporates fundamental elements of two distinct machine learning disciplines. We make use of Bayesian know...

متن کامل

سیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی

Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998